Bernoulli 16(3), 2010, 605-613 
DOI: 10.3150/09-BEJ225 



Sharper lower bounds on the performance of 
the empirical risk minimization algorithm 

GUILLAUME LECUEi and SHAHAR MENDELSON^ 

^CNRS, LATP, Marseille 13000, France. E-mail: lecue@latp.univ-mrs.fr 

^Centre for Mathematics and Its Applications, The Australian National University, Canberra, 
ACT 0200, Australia and Department of Mathematics, Technion, I.I.T., Haifa 32000, Israel. 
E-mail: shahar. mendelson@anu. edu. au 

We present an argument based on the multidimensional and the uniform central limit theo- 
rems, proving that, under some geometrical assumptions between the target function T and the 
learning class F, the excess risk of the empirical risk minimization algorithm is lower bounded 
by 

Esup^ggG, ^ 



where (Gq)qeQ is a canonical Gaussian process associated with Q (a well chosen subset of F) 
and (5 is a parameter governing the oscillations of the empirical excess risk function over a small 
ball in F. 
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1. Introduction 

In this note, we study lower bounds on the empirical minimization algorithm. To explain 
the basic setup of this algorithm, let (ri,/u) be a probability space and set AT to be a 
random variable taking values in fi, distributed according to ^. We are interested in 
the Junction learning (noiseless) problem, in which one observes n independent random 
variables Ai, . . . , A„, distributed according to fi, and the values T(Ai), . . . ,r(A„) of an 
unknown target function T . 

The goal is to construct a procedure that uses the data D = (A^, r(Ai))i<i<„ with a 
risk as close as possible to the best one in F . That is, we want to construct a statistic 
/„ such that for every n, with high /i"-probability, 

i?(/|i?)< inf i?(/) + r„(i^), (1.1) 
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where the risk of / is defined by R{f) = Ei{f{X),T{X)) and £ : ^ R is the loss function 
that measures the pointwise error between T and /. The residue r„{F) somehow captures 
the complexity or richness of the class F and the risk of a statistic / is the conditional 
expectation R{f\D) =E{i{f{X),T{X))\D). 

It is well known (see, e.g., [10]) that if the class F is not too large, for example, 
if it satisfies some kind of uniform central limit theorem, T is bounded by 1 and £ is 
reasonable, then there are upper bounds on rn{F) that are of the form ■\/Comp(_F)/n, 
where Comp(i^) is a complexity term that is independent of n. The algorithm that is 
used to produce the function / is the empirical risk minimization algorithm, in which one 
chooses a function in F that minimizes the empirical risk function / i — > X)r=i ^(/' T){Xi) 
in F. 

There is a well developed theory concerning ways in which the complexity term may be 
controlled, using various parameters associated with the geometry of the class (cf. [2, 8- 
10] and references therein). It turns out that this type of error rate, ~ is very 

pessimistic in many cases. In fact, if the class is small enough, then, under some structural 
assumptions (see, e.g., [1]), rn{F) can be much smaller - of the order of Comp(i^)/7i. 

In this note, we are going to focus on "small classes" in which empirical minimization 
performs poorly, despite the size of the class. Recently, it has been shown (cf. [7]) that 
under mild assumptions on £ and F , if there is more than a single function in 

V f^(/,T): E£(/,T) = inf E£(/,T)|, 

then the following holds: for every n large enough, there will be a perturbation r„ of 
T (with respect to the Loo-norm) for which E^(-,T„) has a unique minimizer in _F, but 
where the empirical minimization algorithm performs poorly trying to predict T„ on 
samples of cardinality n. To be more exact, relative to the target r„, with //"-probability 
at least 1/12, 

R{f\D)>iniR{f) + ^, (1.2) 

where c is a constant depending only on F. 

Although it is reasonable to expect that the larger the set V is, the more likely it is 
that the empirical minimization algorithm will perform poorly, this docs not follow from 
the analysis in [7]. Therefore, our goal here is to provide a bound on the constant c in 
(1.2) that does take into account the complexity of the set of minimizers V . 

Just as in [7] , our method of analysis can be applied to a wide variety of losses. However, 
for the sake of simplicity, we will only present here what is arguably the most important 
case - that in which the risk is measured relative to the squared loss, l{x,y) = {x — y)^. 

To explain our result, we need several definitions from empirical processes theory. 
Other standard notions we require from the theory of Gaussian processes can be found 
in [2]. 

For every set F C L2{^, /x), let {G/: / S F} be the canonical Gaussian process indexed 
by F (i.e., with the covariance structure EGtC^ = (s,<)) and set H{F) =Esupjg^G/ - 



Lower bounds for ERM 



607 



the expectation of the supremum of the Gaussian process indexed by F . Also, for every 
integer n and 5, let 

osc„(F, (5) := sup 

V'^ {/./ieF:||/-h||<(5} 

where (gOT^i ^'^^ standard, independent Gaussian random variables and {Xi)'^^^ are 
independent, distributed according to It is well known that if is a class consisting of 
uniformly bounded functions, then it is a /i-Donsker class if and only if for every 5 > 0, 
osc„(i^, (5) tends to as n tends to infinity (cf. [2], page 301). For any f E F, let 

osc„(i^, /, 6) := — ^E sup 

V'l {/ieF:||/-h||<i5} 

that is, the oscillation in a ball around /. The quantity osc„(i^, /*, 5) is a natural upper 
bound for some intrinsic quantity of the problem we study here (cf. Lemma 2.3). 

Let V be as above - the set of loss functions £{f, T) that minimize the risk in F - select 
f* €F for which £{f*,T) £ V and consider the following subset of excess loss functions: 

Q:={i{f,T)-e{r,T): e{f,T)&V}. 

It turns out that the desired constant in (1.2) can be bounded from below by two 
parameters: the expectation of the supremum of the canonical Gaussian process indexed 
by Q and the oscillation around /*. In particular, if Q is a rich set and one of the 
minimizers of / — >■ E^(/, T) is isolated, then for any n large enough, the error of the 
empirical minimizer with respect to a wisely selected target (denoted by in what 
follows) which is a perturbation of T will be at least ^ H{Q) / ^fn. The core idea of this 
work is that a small, wisely chosen perturbation of a target function T with multiple 
oracles (functions achieving min/gi?E£(t,r)) is badly estimated by the empirical risk 
minimization procedure (for further discussion of this fact, we refer the reader to [7]). 

Although the general philosophy of the proof presented here is similar to the proof 
from [7], it is much simpler. And, in fact, it seems that the method used in the proof 
from [7] cannot be directly extended to obtain the sharper estimate on the constant as 
we do here. Naturally, this result recovers the previous estimates on lower bounds for the 
empirical risk minimization algorithm from [3-6] 

Next, a word about notation. Throughout, all absolute constants will be denoted by 
c, ci and C, Ci, etcetera. Their values may change from line to line. 

If E^(-,r) has a unique minimizer in F^ then we denote it by /*. If the minimizer is not 
unique, then we will fix one function in the set of minimizers and denote it by /* . For every 
f & F, let £(/) = i{f, T) — £{f*,T) be the excess loss function associated with the target 
T. For every < A< 1, set Ta = (1- A)T + A/* and define £>(/)= ^(/, Ta) -£(/*, Ta). 
It is standard to verify (cf. [7] or Theorem 2.1 in what follows) that /* is a minimizer 
of E£(-,Ta) and that under mild convexity assumptions on £ that clearly hold if £ is the 
squared loss, it is the unique minimizer in of / — >■ E^(/, Ta). 
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If Xi, . . . ,Xn is an independent sample selected according to /j,, we set Pnf = n ^ x 
Er=i/(^i) and let Pf = E/. Thus, Esup^gp,|(P„ - P)(/)| is the expectation of the 
supremum of the empirical process indexed by F . Finally, when the target function is 
T\, we will denote the function produced by the empirical risk minimization algorithm 
by fx - which is one element of the set Argmin/gi?P„£(/, Ta). 

Finally, if is a normed space, we denote its unit ball by B{E), the inner product of 
L2{n) will be denoted by (•, •) and the corresponding norm by || • |j. 

Let us now formulate our main result. 

Theorem 1.1. Let F C L2{n) (1 B{Lao) , which is ix- pre- Gaussian (cf. [2]), and assume 
that T e B{Loo). Set £ to be the squared loss and put Q = {£(/): / € F,'EC{f) = 0}. 

There exist some absolute constants Ci and C2 and an integer N{F) for which the 
following holds. For every n > N{F), with -probability at least Ci, 



where 5 is such that for every integer n > N{F), osc„(P, /*, (5) < C2F[{Q) / ^/n and A„ = 



Thus, two parameters control the behavior of the constant in (1.2): the complexity of 
the set of excess loss functions of the oracles of T and the parameter 5. When one of the 
oracles /* of T is isolated, one can take 5 as an absolute constant. This leads to a lower 
bound of the order of H{Q) / ^/n, which is optimal in the sense that an upper bound can 
be obtained of the order of H{Qq)/ y/n. for some set Qq such that Q C Qo C Cf (see, e.g., 
[1] or [3]). In other settings, the lower bound obtained in Theorem 1.1 may fail to match 
exactly with an upper bound. For instance, in settings where the oscillation function 
osc„(P, /*,•) of all the oracles /* of T decreases to zero very slowly and at the same 
convergence rate, the factor should break down the lower bound, whereas it seems 
that it should not appear in the lower bound. From a technical point of view, this comes 
from the fact that we did not take into account the complexity "around" the points in 
Q' (cf. Theorem 2.2 and equation (2.2) in what follows). 

Finally, the noiseless model considered here is the worst case scenario to prove the 
lower bound. Indeed, adding some noise to the target function would increase the lower 
bound. 

2. The lower bound 

The core of the proof is to find a set that can "compete" with aset Br = {f E F: E£a (/) < 
r} that contains /*, in the sense that the empirical excess risk function 



e/:a„(/aJ>C2 




r||r-ri 



C2H{Q)/^. 
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will be more negative on the set than it can possibly be on Br- Once this task is achieved, 
it is obvious that the empirical risk minimization algorithm will produce a function f\ 
which is outside Br and, thus, with a certain probability, 

E[£A(/A)|i?]>r. 

Hence, the proof consists of two parts. First, we will show that the empirical excess 
risk function f „ is likely to be very negative on Q and we will then find some r on which 
the oscillations in Br are small. 

The first result we need is the following lower estimate on the expectation of the 
excess loss relative to the target T\ = {\~ X)T + Xf*, according to the distance of / from 
/*. This proposition is based on the fact that the functional (/, g) i — > g) inherits 
a strong convex structure from the norm and was proven in [7] in a far more general 
situation. 

Theorem 2.1. Let D ~ sup j.-^p\\T — f\\ and p=\\T — f*\\. There exists an absolute 
constant c such that for any function f (z F, if < X < 1/2, r > and 

J<4ii/-riP, 

then 

r<ECx{f). 

Recall that Q = {£(/): ^C{f) = 0, / e F} is the set of excess loss functions associated 
with the true minimizers of / — )• E^(/, T) in i^. Wc will show that if Q' C Q is a finite set, 
then for n large enough, with a non-trivial /^"-probability there will be some C{f) € Q' 
for which the empirical error P„£a„(/) is very negative (for a well chosen A„). 

Theorem 2.2. There exist constants Ci,C2 and C3, depending only on the Lodp)- 
diameter of F U {T}, for which the following holds. If Q' is a finite subset of Q that 
contains 0, then there exists an integer uq ~ na{Q') such that for every integer n > uq, 
with fi"" -probability at least ci, 

inf iV(/:A„(/))(xo<-c2:^, 

where A„ = c^H{Q') / ^/n and H{Q') ~ Esup^^g, Gq is the expectation of the canonical 
Gaussian process associated with Q' . 

Proof. Let M = \Q'\ and recall that each q E Q' = {qi, . . . ,qM} has mean zero. Con- 
sider the random vector U = . . . , qM{X)) e M^^ and let {Ui)°Zi be independent 
copies of U (i.e., Ui = {qi{Xi), . . . ,qM{Xi))). By the vector-valued central limit theorem 
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(see, e.g., [2]), n -^/^ Y^'i^i Ui converges weakly to the canonical Gaussian process indexed 
by Q' , which we denote by G. Fix t <0 and < c < 1, to be given later, for which 

At ^ {x e m.'^^: VI < j < M, Xj > t} 

is such that p := Pr(G e At) < c. Set uq = no{t, c) to be such that for n > uq, 

PiiGeAt)-Pr(^n~'/'pU,eAt^ < i^, 

which clearly exists by weak convergence. Since 

PT^l<j<M: n-^^^Y^^{U^,ej) <tj = 1 - Pr ^n'^/^ ^ CX, € At j 

1 — n 1 — c 

it follows that, with probability at least ci. 



1 " t 
inf - Vg(Xi) < 

2 — 1 



It remains to show that one may take t~ — (Esup^gg, Gg)/4. Indeed, by the symmetry 
of the Gaussian process, it follows that (for this choice of t) 



P = Pt{G<E At) ^ Pi ( sup Gg < E sup GJ/4 

Let Z = supggQ, Gq and = sup^^g/ EG^. Since G Q', it follows that if KZ ~ 0, then it 
is clear that p= 1/2. Otherwise, using the concentration property of Z around its mean 
(see, e.g., [9]) and since a < cqEZ (where cq is an absolute constant), there exists an 
absolute constant A > such that 

E[Zl[z>EZ+Aa]] < iEZ)/4. 

Therefore, 

EZ = E(2'(l[2<(E2)/4] + 1[(EZ)/4<Z<EZ+A(t] + ^Z>EZ+Acr])) 

< {EZ)/2 + {EZ){1 + coA) Pr((EZ)/4 < Z). 

Thus, Pr((EZ)/4 < Z) > [2(1 + cqA)]-^ and so p < 1 - [2(1 + cqA)]'^ := c (which is an 
absolute constant), implying that, with probability greater than ci. 



inf lV(/:(/))(x.)<-c /^"P^!5'^^ 
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Next, observe that for small values of A (as we will have in our construction), £(/) 
is a good approximation of Cx{f) with respect to the Loo(M)-norm. Indeed, C\{f) = 
e{f,Tx)-i{r,n) and £(/) = £(/,r) -£(/*, T); hence, for every / e F, 

wcxif) - £(/)iioo < \m,Tx)-£{fX)\\oo + \m*,Tx) - ^(r,T)iioo 

< 2p||iip||r- TaIIoo = 2A|K||iip||T - r lU < C3A. 

Thus, if one selects A„ = (c2/(2c3))7i^^/^EsupggQ, Gg, then, with probability greater 
than ci, 

Esup„f:r,' Gn 

inf PnCxM)<-C2 □ 

Fix a finite set Q' C Q for which H{Q') > H{Q)/2 and G Q'. Clearly, such a set exists 
because Q is a pre-Gaussian as a subset of the pre-Gaussian class {C{f): f G F}. Let 
V' = {/GF: £(/)G0'}. 

Recall that a bounded class of functions F is fi-Donsker if and only if for every u > 0, 
there exist S > and an integer uq such that for every n > np, osc„(_F, (5) < u. Also, note 
that osCn{F,f*,d) < osc„(F, 5). Let u = r]H{Q'), where 77 is an absolute constant, to be 
fixed later, and set 6 and ni to be such that for n > ni, 

osc„(F,r,(5)<77i7(g') (2.1) 

(such S and ni necessarily exist because F is /x-Donsker). 

The next lemma is standard and follows from a symmetrization argument combined 
with Slepian's lemma. Its proof may be found in, for example, [7]. 

Lemma 2.3. There exists an absolute constant c for which the following holds. For any 
F' CF such that f* G F' and any 0<X<1, 



Esup |(P-P„)(/:a(/))|<cEsup 

/6F' f€F' 



n 

-Y.g.{f - f*){X, 
11 ^ — ^ 



n 

i=l 



where {gi)f^i are independent, standard Gaussian variables. 

We are now ready to control the oscillation of the empirical excess risk function in the 
set Br = {feF: ECx<r}. 

Theorem 2.4. Let ci, 02 and A„ be defined as in Theorem 2.2, and let 6 and ni be 
as above. There exists an absolute constant C3 such that for any integer n>ni, with 
-probability at least 1 — Ci/2, 

inf p„/:. (/)>-^, 
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whe 



C3 



Proof. By Theorem 2.1. for any r, A > 0, if / e F is such that E£a(/) < r, then 

J>4ll/-/T, 

where D and p were defined in Theorem 2.1. Thus, 

{feF: ECxif)<r}c{feF: \\f-r\\<c,^}, 
where C4 = Ci{p, D). Hence, by Lemma 2.3, for n>ni, 



; sup -F„£a(/)<C5E sup 

{feF:ECxif)<r} {/eF:||/-/*||<C4V^} 



n 



< ^0SC„(i^,/*,C4V^) < -^vHiQ'), 

provided that c^y/r/X < 5. Thus, for an appropriate choice of 77 (e.g., 77 = 0102/(^05) 
would do) and setting ?-„ (c^/ {2cl))n~^^^ H {Q')S^ (which is smaher than i5^A„/c|), it 
is evident that 



E sup 

{/eF:E£;,„(/)<r„} 



Therefore, with /i"-probabiUty at least 1 — ci/2. 



sup -PnCx^ if) < ^ r- , 

{feF:ECx„if)<r„} ^V" 



as claimed. 

We can now prove our main result. 



□ 



Proof of Theorem 1.1. By Theorem 2.2 applied to the set Q' , there exists some integer 
1^0 ~ n-o{Q') such that for every n>nQ, with //"-probability at least ci. 



inf PnCxAf)<-C2 



H{Q') 



(2.2) 



where ci and C2 are two absolute constants. 

By Theorem 2.4, for any integer n > rii, with /{"-probability at least 1 — ci/2, 



(2.3) 
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Hence, combining equations (2.2) and (2.3), with /i" -probability at least ci/2, the 
excess risk of /a„ is such that '¥\C\^(f\^)\D\ < — C2H {Q' ) / {■s/n) , while for every 
function f ^ F with EC\^{f) < rn, the empirical excess risk satisfies P„£a„(/) > 
—C2H{Q')/{2^/n). Therefore, the empirical risk minimization algorithm has an excess 
risk (conditionally on the data D) larger than r„, with probability greater than ci/2, as 
claimed. □ 
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